-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bring CUDA support to Tracking.jl
#33
Conversation
Update forked master due to hotfix
Incorporate the hotfix from the master repo
…com/ozmaden/Tracking.jl into feature/gpu-accelerated-correlation
Compat LoopVectorization
Compat LoopVectorization
Codecov Report
@@ Coverage Diff @@
## master #33 +/- ##
==========================================
- Coverage 90.59% 84.77% -5.82%
==========================================
Files 16 17 +1
Lines 404 473 +69
==========================================
+ Hits 366 401 +35
- Misses 38 72 +34
Continue to review full report at Codecov.
|
A comment on the last commit 90358a9. It checks inside the julia> signal_cuarray
2500-element CuArray{ComplexF32, 1, CUDA.Mem.DeviceBuffer}:
julia> results_gpu = track(signal_cuarray, state_gpu, sampling_frequency)
ERROR: ArgumentError: Signal is not a StructArray, initialize the signal properly and try again.
julia> fake_struct = StructArray(signal_cpu)
2500-element StructArray(::Vector{Float32}, ::Vector{Float32}) with eltype ComplexF32:
julia> results_gpu = track(fake_struct, state_gpu, sampling_frequency)
ERROR: ArgumentError: Signal and GNSS codes are not of the same type. Please check if CPU or GPU is used. It comes with a slight penatly on the performance in the GPU case, but I think it's manageable. #BEFORE THE COMMIT
julia> @benchmark track($signal_gpu, $state_gpu, $sampling_frequency)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 81.162 μs … 26.223 ms ┊ GC (min … max): 0.00% … 41.79%
Time (median): 84.806 μs ┊ GC (median): 0.00%
Time (mean ± σ): 98.184 μs ± 439.458 μs ┊ GC (mean ± σ): 3.16% ± 0.71%
█▄
▅██▅▄▄▄▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▂▂▂▂▂▂▂▂▂ ▂
81.2 μs Histogram: frequency by time 177 μs <
Memory estimate: 29.80 KiB, allocs estimate: 418.
#AFTER THE COMMIT
julia> @benchmark track($signal_gpu, $state_gpu, $sampling_frequency)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 83.498 μs … 26.311 ms ┊ GC (min … max): 0.00% … 40.27%
Time (median): 89.555 μs ┊ GC (median): 0.00%
Time (mean ± σ): 111.975 μs ± 519.990 μs ┊ GC (mean ± σ): 3.66% ± 0.80%
█▇▄▂▁▁▁▁ ▂
███████████▇▇▇▆▄▆▆▅▆▆▇▅▄▄▅▄▄▅▄▃▃▅▄▃▃▃▁▄▁▄▃▃▄▁▁▃▃▁▃▄▁▃▅▄▄▅▄▃▄▄ █
83.5 μs Histogram: log(frequency) by time 460 μs <
Memory estimate: 29.80 KiB, allocs estimate: 418. No regression for the CPU: #BEFORE THE COMMIT
julia> @benchmark track($signal_cpu, $state_cpu, $sampling_frequency)
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
Range (min … max): 8.174 μs … 2.401 ms ┊ GC (min … max): 0.00% … 99.11%
Time (median): 8.553 μs ┊ GC (median): 0.00%
Time (mean ± σ): 9.361 μs ± 32.195 μs ┊ GC (mean ± σ): 4.84% ± 1.40%
▂█▅
▃████▅▅▄▃▃▄▄▄▄▅▅▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▂▂▂▁▁▂▁▁▁▂▂▂▂ ▃
8.17 μs Histogram: frequency by time 13.1 μs <
Memory estimate: 5.12 KiB, allocs estimate: 94.
#AFTER THE COMMIT
julia> @benchmark track($signal_cpu, $state_cpu, $sampling_frequency)
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
Range (min … max): 7.917 μs … 2.161 ms ┊ GC (min … max): 0.00% … 99.45%
Time (median): 8.399 μs ┊ GC (median): 0.00%
Time (mean ± σ): 9.155 μs ± 30.340 μs ┊ GC (mean ± σ): 4.67% ± 1.40%
██▆▁
▂▇████▅▄▄▅▅▅▆▆▅▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▂▁▁▂▂▂▂▂▂▂▂▂▁▂▂▂▂ ▃
7.92 μs Histogram: frequency by time 13.6 μs <
Memory estimate: 5.12 KiB, allocs estimate: 94. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some minor changes. Otherwise looks good!
src/carrier_replica.jl
Outdated
Floating point CPU StructArray carrier generation | ||
""" | ||
function gen_carrier_replica!( | ||
carrier_replica::StructArray{Complex{T},1,NamedTuple{(:re, :im),Tuple{Array{T,1},Array{T,1}}},Int64}, | ||
carrier_frequency, | ||
sampling_frequency, | ||
start_phase, | ||
carrier_amplitude_power::Val{N}, | ||
start_sample, | ||
num_samples | ||
) where { | ||
T <: AbstractFloat, | ||
N | ||
} | ||
sample_range = start_sample:num_samples + start_sample - 1 | ||
@views @. carrier_replica.re[sample_range] = 2pi * (sample_range) * carrier_frequency / sampling_frequency + start_phase | ||
@. carrier_replica.im[sample_range] = sin(carrier_replica.re[sample_range]) | ||
@. carrier_replica.re[sample_range] = cos(carrier_replica.re[sample_range]) | ||
return carrier_replica | ||
end | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is a left over from previous tests. I think this can be removed.
src/correlator.jl
Outdated
@@ -124,7 +124,7 @@ Get prompt correlator | |||
function get_prompt(correlator::AbstractCorrelator, correlator_sample_shifts) | |||
correlator.accumulators[get_prompt_index(correlator_sample_shifts)] | |||
end | |||
|
|||
CUDA.dot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's that?
src/tracking_state.jl
Outdated
It is determined based on the signal, which is given in the track | ||
function. | ||
""" | ||
struct CarrierReplicaGPU{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CarrierReplicaGPU is never used, isn't it? I guess we can get rid of this structure.
src/tracking_state.jl
Outdated
It is determined based on the signal, which is given in the track | ||
function. | ||
""" | ||
struct DownconvertedSignalGPU{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DownconvertedSignalGPU is never used, isn't it? I guess we can get rid of this structure.
src/tracking_state.jl
Outdated
DS <: DownconvertedSignalCPU, # Union{DownconvertedSignalCPU, CuArray{ComplexF32}} | ||
CAR <: CarrierReplicaCPU, # Union{CarrierReplicaCPU, CuArray{ComplexF32}} | ||
COR <: Vector{Int8}, # Union{Vector{Int8}, CuArray{Float32}} | ||
DS <: Union{DownconvertedSignalCPU, DownconvertedSignalGPU}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can leave DownconvertedSignalCPU
even for the GPU case and initialize that as an empty array.
src/tracking_state.jl
Outdated
CAR <: CarrierReplicaCPU, # Union{CarrierReplicaCPU, CuArray{ComplexF32}} | ||
COR <: Vector{Int8}, # Union{Vector{Int8}, CuArray{Float32}} | ||
DS <: Union{DownconvertedSignalCPU, DownconvertedSignalGPU}, | ||
CAR <: Union{CarrierReplicaCPU, CarrierReplicaGPU}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can leave CarrierReplicaCPU
even for the GPU case and initialize that as an empty array.
@zsoerenm Thanks for the feedback. ccb75f0 deals with the first two comments, the others are handeled by 2a24f7d. The latter even improves the performance: Before 2a24f7d julia> @benchmark track($signal_gpu, $state_gpu, $sampling_frequency)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 82.073 μs … 22.217 ms ┊ GC (min … max): 0.00% … 38.81%
Time (median): 86.151 μs ┊ GC (median): 0.00%
Time (mean ± σ): 103.747 μs ± 487.141 μs ┊ GC (mean ± σ): 3.95% ± 0.84%
▆█▇▄▄▅▄▃▂▁ ▂
██████████▇▆▆▇▇█▇▆▇▇▇▆▇▇▆▆▅▅▅▅▅▅▅▅▅▆▄▄▄▅▃▅▅▅▃▁▁▁▄▄▁▁▃▄▁▁▄▃▄▅▄ █
82.1 μs Histogram: log(frequency) by time 211 μs <
Memory estimate: 29.80 KiB, allocs estimate: 418. After 2a24f7d julia> @benchmark track($signal_gpu, $state_gpu, $sampling_frequency)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 64.100 μs … 21.582 ms ┊ GC (min … max): 0.00% … 38.88%
Time (median): 66.980 μs ┊ GC (median): 0.00%
Time (mean ± σ): 82.030 μs ± 475.433 μs ┊ GC (mean ± σ): 4.97% ± 0.86%
██
▄██▇▃▃▃▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂ ▂
64.1 μs Histogram: frequency by time 134 μs <
Memory estimate: 29.33 KiB, allocs estimate: 406. |
src/downconvert_and_correlate.jl
Outdated
if sample_idx <= num_samples && antenna_idx <= num_ants && corr_idx <= num_corrs | ||
# generate carrier | ||
carrier_im[sample_idx], carrier_re[sample_idx] = CUDA.sincos(2π * ((sample_idx - 1) * carrier_frequency / sampling_frequency + carrier_phase)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what purpose do you save the sincos result in a vector? Is it for performance improvements in the future? If this is the case, let's dismiss that here to have a clean baseline.
src/tracking_state.jl
Outdated
@@ -51,6 +50,7 @@ function DownconvertedSignalCPU(num_ants::NumAnts{N}) where N | |||
) | |||
end | |||
|
|||
@inline gen_blank_carrier_gpu(system::S, num_samples) where {CO <: CuMatrix,S <: AbstractGNSS{CO}} = StructArray{ComplexF32}((CuArray{Float32}(undef, num_samples), CuArray{Float32}(undef, num_samples))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be dismissed, if you do not save the sincos result in a vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't used anymore, is it? So I think you can rid of this, too.
src/tracking_state.jl
Outdated
CAR <: CarrierReplicaCPU, # Union{CarrierReplicaCPU, CuArray{ComplexF32}} | ||
COR <: Vector{Int8}, # Union{Vector{Int8}, CuArray{Float32}} | ||
DS <: Union{DownconvertedSignalCPU, Nothing}, | ||
CAR <: Union{CarrierReplicaCPU, StructVector{ComplexF32, NamedTuple{(:re, :im), Tuple{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, Int64}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
@zsoerenm I have implemented your proposal. Now that I thought about it, I too decided it made more sense to just leave everything at Before eab881c: julia> @benchmark track($signal_gpu, $state_gpu, $sampling_frequency)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 65.749 μs … 24.988 ms ┊ GC (min … max): 0.00% … 46.31%
Time (median): 69.804 μs ┊ GC (median): 0.00%
Time (mean ± σ): 92.254 μs ± 509.974 μs ┊ GC (mean ± σ): 4.92% ± 0.90%
█▆▅▃▂▁▁▁▁ ▁
██████████▇▆▆▄▄▆▆▅▅▄▄▅▄▆▄▅▄▃▃▅▃▃▄▅▅▅▃▄▄▅▄▅▆▅▃▅▁▅▅▅▄▄▃▃▁▁▅▄▄▄ █
65.7 μs Histogram: log(frequency) by time 381 μs <
Memory estimate: 29.33 KiB, allocs estimate: 406. After eab881c: julia> @benchmark track($signal_gpu, $state_gpu, $sampling_frequency)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 65.857 μs … 25.658 ms ┊ GC (min … max): 0.00% … 33.48%
Time (median): 69.429 μs ┊ GC (median): 0.00%
Time (mean ± σ): 87.554 μs ± 426.402 μs ┊ GC (mean ± σ): 3.05% ± 0.63%
█▆▄▃▁▁▁▁▁▁ ▁
███████████▆▆▆▇▅▆▅▅▄▅▅▄▅▅▆▄▄▅▄▄▁▁▄▄▄▅▃▁▄▄▃▃▃▅▅▄▄▅▃▄▅▄▄▄▄▄▄▄▃ █
65.9 μs Histogram: log(frequency) by time 366 μs <
Memory estimate: 26.23 KiB, allocs estimate: 388. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, looks good.
Just 2 minor code smells: One is duplicate code and the other is a function, that isn't used anymore.
BTW: I'm surprised that writing to a CuArray in the kernel does not generate an overhead.
src/tracking_loop.jl
Outdated
function choose(replica::Nothing, signal::AbstractArray) | ||
nothing | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those 3 lines are exactly the same as 370-372. So I think we can get rid of this.
src/tracking_state.jl
Outdated
@@ -51,6 +50,7 @@ function DownconvertedSignalCPU(num_ants::NumAnts{N}) where N | |||
) | |||
end | |||
|
|||
@inline gen_blank_carrier_gpu(system::S, num_samples) where {CO <: CuMatrix,S <: AbstractGNSS{CO}} = StructArray{ComplexF32}((CuArray{Float32}(undef, num_samples), CuArray{Float32}(undef, num_samples))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't used anymore, is it? So I think you can rid of this, too.
I have removed the mentioned code smell.
Yeah, strange. Wonder why that is, I have run the benchmark in multiple independent sessions just to be sure. The results stay the same. |
This brings (the long awaited) GPU functionality to
Tracking.jl
. At the moment onlyCUDA.jl
.Notes / Features:
CUDA.jl
kernel combining carrier and code replication, downconversion, and correlation for arbitrarynum_samples
,num_ants
,num_correlators
. (arbitrary as long as the hardware allows)CUDA.functional()
)README.md
with a GPU exampleREADME.md
for correct usage after Speculative: Move PRN to TrackingState #31This is a first fully functioning version. Optimizations regarding the performance of the CUDA kernel will follow. This version will be considered a baseline.